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ABSTRACT 

In  recent  NIST  evaluations  on  sentence  boundary  detection,  a 
single  error  metric  was  used  to  describe  performance.  Additional 
metrics,  however,  are  available  for  such  tasks,  in  which  a  word  stream 
is  partitioned  into  subunits.  This  paper  compares  alternative  evalua¬ 
tion  metrics — including  the  NIST  error  rate,  classification  error  rate 
per  word  boundary,  precision  and  recall,  ROC  curves,  DET  curves, 
precision-recall  curves,  and  area  under  the  curves — and  discusses 
advantages  and  disadvantages  of  each.  Unlike  many  studies  in  ma¬ 
chine  learning,  we  use  real  data  for  a  real  task.  We  find  benefit  from 
using  curves  in  addition  to  a  single  metric.  Furthermore,  we  find 
that  data  skew  has  an  impact  on  metrics,  and  that  differences  among 
different  system  outputs  are  more  visible  in  precision-recall  curves. 
Results  are  expected  to  help  us  better  understand  evaluation  metrics 
that  should  be  generalizable  to  similar  language  processing  tasks. 

Index  Terms —  sentence  boundary  detection,  precision,  recall, 
ROC  curve 

1.  INTRODUCTION 

Sentence  boundary  detection  has  received  increasing  attention  in  re¬ 
cent  years,  as  a  way  to  enrich  speech  recognition  output  for  better 
readability  and  improved  downstream  natural  language  processing 
[1,  2,  3].  Automatic  sentence  boundary  detection  itself  was  part  of 
the  recent  NIST  rich  transcription  evaluations.*  In  addition,  studies 
have  been  conducted  to  evaluate  the  impact  of  sentence  segmenta¬ 
tion  on  subsequent  tasks,  including  speech  translation,  parsing,  and 
speech  summarization  [2,  4,  5] . 

In  the  NIST  evaluations,  system  performance  for  sentence  bound¬ 
ary  detection  has  been  evaluated  using  an  error  rate  (total  number  of 
inserted  and  deleted  boundaries,  divided  by  the  number  of  reference 
boundaries).  Research  studies  ([2,  3])  have  also  begun  to  look  at  the 
ROC  curve,  DET  curve,  and  F-measure.  Of  course,  since  the  ulti¬ 
mate  goal  is  to  aid  downstream  language  processing  tasks,  a  proper 
way  to  evaluate  sentence  boundary  detection  would  be  to  look  at 
the  impact  on  the  downstream  tasks.  In  fact,  in  [2]  it  was  shown 
that  the  optimal  segmentation  for  parsing  is  indeed  different  from 
that  obtained  when  optimizing  for  sentence  boundary  detection  ac¬ 
curacy  using  the  aforementioned  NIST  metric.  Other  application- 
based  studies  include  the  impact  of  sentence  boundary  detection  for 
machine  translation  [4]  and  for  summarization  [5] . 

This  paper  examines  various  evaluation  metrics  and  discusses 
their  advantages  and  disadvantages.  We  focus  on  the  boundary  de¬ 
tection  task  itself,  rather  than  its  impact  on  downstream  applica¬ 
tions.  In  addition,  we  evaluate  the  effect  of  different  priors  of  the 

*  See  http://www.nist.gov/speech/tests/rt/rt2004/fall/  for  more  information 
on  NIST  evaluations. 


event  of  interest  (i.e.,  sentence  boundaries)  by  using  different  cor¬ 
pora.  Highly  skewed  priors  are  inherent  to  this  and  related  tasks, 
since  boundary  events  are  typically  rare  compared  to  nonboundaries. 
Unlike  most  studies  in  machine  learning,  this  work  focuses  on  a  real 
language  processing  task.  The  study  is  expected  to  help  us  better  un¬ 
derstand  evaluation  metrics  that  will  be  generalizable  to  many  simi¬ 
lar  language  processing  tasks  involving  segmentation  of  the  speech 
stream  into  subunits  —  for  example,  topic,  story,  or  dialog  act. 

2.  METRICS 

The  task  of  sentence  boundary  detection  in  speech  is  to  determine 
the  location  of  boundaries  given  a  word  sequence  (typically  from  a 
speech  recognizer)  and  associated  audio  signal.  In  this  study,  we 
use  reference  transcriptions,  to  avoid  confounding  discussion  with 
the  effects  of  recognition  errors  themselves;  the  metrics  for  recog¬ 
nition  output  apply  similarly  after  aligning  recognition  output  with 
reference  transcriptions.  We  can  represent  the  task  as  two-way  clas¬ 
sification  or  detection,  that  is,  label  each  interword  boundary  as  ei¬ 
ther  “sentence”  or  “no- sentence”.  Table  1  shows  the  notation  we  use 
for  the  confusion  matrix  result.  For  a  given  task,  the  total  number 
of  samples  is  tp  +  fn  +  fp  +  tn,  and  the  total  number  of  positive 
samples  is  tp  -\-  fn. 


system  true 

system  false 

reference  true 

tp 

fn 

reference  false 

fp 

tn 

Table  I.  A  confusion  matrix  for  the  system  output.  “True”  means 
positive  examples,  that  is,  sentence  boundaries  in  this  task. 

2.1.  Metrics  Description 

Various  metrics  have  been  used  for  evaluating  sentence  boundary 
detection  or  similar  tasks,  in  individual  studies.  For  example,  in 
[6,  7],  metrics  are  developed  that  treat  the  sentences  as  units  and 
measure  whether  the  reference  and  hypothesized  sentences  match 
exactly.  Slot  error  rate  [8]  was  introduced  first  for  information  ex¬ 
traction  tasks,  and  later  used  for  sentence  boundary  detection.  Kappa 
statistics  have  often  been  used  to  evaluate  human  annotation  consis¬ 
tency,  and  can  also  be  used  to  evaluate  system  performance,  that  is, 
treating  system  output  as  a  ‘human’  annotation.  Other  metrics  in 
the  general  classification  literature,  such  as  cost  curves  [9],  have  not 
been  widely  used  for  evaluating  sentence  boundary  detection. 

In  this  paper  we  focus  on  the  following  metrics: 

•  NIST  metric.  The  NIST  error  rate  is  the  sum  of  the  inser¬ 
tion  and  deletion  errors  per  the  number  of  reference  sentence 
boundaries.  Using  the  notation  in  Table  1,  this  becomes 

MTCT  .  fn  +  fp 

NIST  error  rate  =  - - — 

tp+  fn 
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Note  that  the  NIST  evaluation  tool  mdevaf  allows  bound¬ 
aries  within  a  small  window  to  match  up,  in  order  to  take  into 
account  different  alignments  from  speech  recognizers.  We 
ignore  that  detail  in  this  study  and  simply  treat  the  task  as 
straightforward  classification. 

•  Classification  error  rate.  If  this  task  is  represented  as  a  clas¬ 
sification  task  for  each  interword  boundary  point,  then  the 
classification  error  rate  (CER)  is 

CER=  - - 

tp  +  fn  +  fp  +  tn 


dominant  in  the  other  [11].  Such  a  relationship  also  holds  for  the 
ROC  and  DET  curves.  This  is  straightforward  from  the  definition 
of  these  curves — true  positive  versus  false  positive  in  ROC  curves, 
and  miss  rate  (i.e.,  1  —  true  positive)  versus  false  positive  on  the 
scale  of  the  normal  deviate  in  DET  curves.  Since  normal  deviation 
is  a  monotonic  function,  changing  the  axis  to  a  normal  deviate  scale 
preserves  the  property  of  being  dominant. 

3.  ANALYSIS  ON  RT04  DATA  SET 

3.1.  Sentence  Boundary  Detection  Task  Setup 


Precision  and  recall.  These  metrics  are  widely  used  in  infor¬ 
mation  retrieval,  and  are  defined  as  follows: 


precision  = 


recall  = 


tp 


tp  +  fp 

tp 

tp  +  fn 


A  single  metric  is  often  used  to  reflect  both  precision  and  re¬ 
call,  and  their  tradeoff: 


_  2  X  precision  x  recall 

F-measure  =  - - - 

precision  +  recall 

•  ROC  curve.  Receiver  operating  characteristic  (ROC)  curves 

are  used  for  decision  making  in  many  detection  tasks.  They 
show  the  relationship  between  true  positives  (=  )  and 

false  positives  (=  jj?^)  as  the  decision  threshold  varies. 

•  PR  curve.  The  precision-recall  (PR)  curve  shows  what  hap¬ 
pens  to  precision  and  recall  as  the  decision  threshold  is  var¬ 
ied. 

•  DET  curve.  A  detection  error  tradeoff  (DET)  curve  plots  the 

miss  rate  versus  the  false  alarms  (i.e.,  false  posi¬ 

tive),  using  the  normal  deviate  scale  [10].  It  is  widely  used  in 
the  task  of  speaker  recognition,  but  less  used  for  other  classi¬ 
fication  problems. 

•  AUC.  The  curves  above  provide  a  good  view  for  the  system’s 
performance  at  different  decision  points.  However,  a  single 
number  is  often  preferred  when  comparing  two  curves  or  two 
models.  Area  under  the  curves  (AUC)  is  used  for  this  pur¬ 
pose.  It  is  often  used  for  both  ROC  and  PR  curves,  but  less  so 
for  DET  curves.^ 


2.2.  Relationship 

For  a  given  task,  the  number  of  positive  samples  (i.e.,  np  =  tp  +  fn) 
and  the  total  number  of  samples  (i.e.,  tp  +  fn  +  fp  +  tn)  are  fixed. 
Therefore,  precision  and  recall  uniquely  determine  the  confusion 
matrix,  and  hence  the  NIST  error  rate  and  classification  error  rate. 
Each  of  the  two  error  rates  can  uniquely  determine  the  other  since 
the  denominators  in  them  are  proportionally.  However,  from  the  two 
error  rates  (without  detailed  information  about  insertion  or  deletion 
errors),  we  cannot  infer  the  precision  and  recall  rate. 

The  ROC  and  PR  curves  are  one-to-one  mapping  curves.  Each 
point  in  one  curve  uniquely  determines  the  confusion  matrix,  and 
thus  the  point  in  the  other  curve.  For  the  ROC  and  PR  curves,  it  has 
been  shown  that  if  a  curve  is  dominant  in  one  space,  then  it  is  also 

^The  scoring  tool  is  available  from  http://www.nist.gov/speech/tests/rt/ 
rt2004/fall/tools/. 

^For  DET  curves,  single  metrics  such  as  the  EER  (equal  error  rate)  and 
DCF  (detection  cost  function)  are  often  used  in  speaker  recognition. 


To  study  the  behavior  of  different  metrics  on  real  data,  we  evaluated 
sentence  boundary  detection  for  a  state-of-the-art  system  on  two  dif¬ 
ferent  corpora.  We  used  the  RT04  NIST  evaluation  data,  conver¬ 
sational  telephone  speech  (CTS)  and  broadcast  news  speech  (BN). 
The  total  number  of  words  in  the  test  set  is  about  4.5K  in  BN  and 
3.5K  in  CTS.  The  prior  probability  of  a  sentence  boundary  is  dif¬ 
ferent  across  the  two  corpora,  about  14%  for  CTS  and  8%  for  BN."^ 
Comparing  the  two  corpora  allows  us  to  investigate  the  effect  of  data 
skew  differences  on  the  metrics. 

System  output  is  based  on  the  ICSI-l-SRI-l-UW  sentence  bound¬ 
ary  detection  system  [3].  Five  different  system  outputs  are  used 
in  this  study:  decision  tree  classifiers  using  prosodic  features  only, 
4-gram  language  model  (EM),  hidden  Markov  model  (HMM)  that 
combines  prosody  and  language  model,  maximum  entropy  (Maxent) 
model  using  prosodic  and  textual  information,  and  the  combination 
of  HMM  and  Maxent.^  For  all  these  approaches,  there  is  a  posterior 
probability  generated  for  each  interword  boundary,  which  we  use  to 
plot  the  curves  or  to  set  the  decision  threshold  for  a  single  metric. 

3.2.  Results  and  Discussion 

Table  2  shows  different  single  performance  measures  for  sentence 
boundary  detection  for  CTS  and  BN.  A  threshold  of  0.5  is  used  to 
generate  the  hard  decision  for  each  boundary  point.  We  used  the  rec¬ 
ognizer  forced  alignment  output  (slightly  different  from  the  original 
transcripts)  as  the  input  to  sentence  boundary  detection.  Reference 
boundaries  were  obtained  by  matching  the  original  sentence  bound¬ 
aries  to  the  alignment  output. 

Figure  1  shows  the  ROC,  PR,  and  DET  curves  for  the  five  system 
outputs,  for  both  CTS  and  BN.  The  points  shown  in  the  PR  curves 
correspond  to  using  0.5  as  the  decision  threshold  (i.e.,  the  results 
shown  in  Table  2).  The  points  for  HMM,  Maxent,  and  their  combi¬ 
nation  are  close  to  each  other,  and  thus  are  not  individually  labeled 
with  arrows. 

In  Table  2,  for  almost  all  the  cases  (except  recall  on  CTS),  the 
combination  of  HMM  and  Maxent  achieves  the  best  performance. 
The  curves  also  show  that  generally  HMM,  Maxent,  and  their  com¬ 
bination  are  close  to  each  other,  and  much  better  than  the  other  two 
curves  for  the  prosody  and  EM,  for  both  CTS  and  BN.  However,  in 
this  study,  our  goal  is  not  to  determine  the  best  model  to  optimize  a 
single  performance  metric.  We  are  more  interested  in  looking  at  dif¬ 
ferent  system  outputs  and  how  they  behave  with  respect  to  evaluation 
metrics. 

3.2.1.  Domain  and  metrics 

BN  and  CTS  have  different  speaking  styles  and  class  distributions 
(priors  of  sentence  boundaries),  and  thus  comparisons  across  the  two 
domains  using  a  single  metric  may  not  be  informative.  For  example, 
the  CER  is  similar  across  the  two  domains  (for  HMM,  Maxent,  and 

^In  the  EARS  program,  each  sentence-like  unit  was  called  an  “SU”  [12], 

^Details  of  the  modeling  approaches  can  be  found  in  [3], 
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Table  2.  Different  performance  measures  for  sentence  boundary  detection  in  CTS  and  BN.  The  decision  threshold  is  0.5. 
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Fig.  1.  ROC,  PR,  and  DET  curves  for  CTS  and  BN  from  five  different  systems:  Prosody,  LM,  HMM,  Maxent,  and  the  combination  of  HMM 
and  Maxent. 


their  combination),  but  to  some  extent  that  is  because  of  the  higher 
degree  of  skew  for  BN  than  for  CTS.  Using  other  metrics  such  as  the 
NIST  error  rate  and  precision/recall  can  better  reflect  inherent  per¬ 
formance  differences.  As  expected,  for  imbalanced  data  sets  where 
the  majority  is  negative  examples,  ROC  curves  show  weakness  in 
distinguishing  different  classifiers  or  comparing  across  tasks,  since 
the  large  number  of  negative  examples  (i.e.,  tn  +  fp)  often  results 
in  small  differences  between  the  false  positive  rates.  The  AUC  for 
the  ROC  curves  is  quite  high  for  both  BN  and  CTS,  whereas  in  the 
PR  space  the  difference  between  BN  and  CTS  is  more  noticeable. 
The  PR  curves  and  the  associated  AUC  values  are  much  worse  in 
BN  than  CTS.  For  the  imbalanced  data,  PR  curves  often  have  ad¬ 
vantages  in  exposing  the  difference  between  algorithms,  as  shown  in 
Figure  1.  DET  curves  also  better  illustrate  the  difference  between 
the  curves  across  the  two  corpora  (e.g.,  the  slopes  of  the  curves). 

3.2.2.  Interaction  among  domains,  models,  and  metrics 

There  is  some  difference  between  models  across  the  two  domains. 
For  BN,  using  only  the  prosody  model  performs  similarly  to  or  slightly 
better  than  the  LM  alone,  in  terms  of  error  rate,  precision,  and  re¬ 
call.  However,  the  AUC  values  for  the  prosody  model  are  worse 
than  those  for  the  LM,  for  both  ROC  and  PR  curves.  As  shown  from 
the  PR  curves,  in  the  region  around  the  decision  threshold  (and  also 
the  region  to  the  left,  i.e.,  with  lower  recall),  the  prosody  curve  is 
better  than  the  LM,  but  not  in  other  regions.  Therefore,  using  the 
curves  helps  to  determine  what  model  or  system  output  is  better  for 
the  region  of  interest.  Lor  BN,  the  PR  curves  for  the  prosody  model 
and  the  LM  cross  in  the  middle,  but  this  is  not  so  for  CTS,  where 
the  LM  alone  achieves  better  performance  than  prosody  alone,  using 
most  of  the  metrics  (except  precision,  as  shown  in  Table  2). 

3.2.3.  Single  metrics  versus  curves 

Table  2  shows  that  the  different  measurements  for  this  sentence  bound¬ 
ary  task  are  highly  correlated  for  one  corpus  —  an  algorithm  is  often 
better  than  another  using  many  single  metrics.  However,  one  sin¬ 
gle  metric  does  not  provide  all  the  information,  since  it  is  the  mea¬ 
sure  for  one  particular  chosen  decision  point.  As  described  earlier, 
the  NIST  error  rate  and  CER  cannot  determine  confusion  matrix,  or 
precision  and  recall,  as  they  combine  insertion  and  deletion  errors 
(although  that  information  can  be  available).  Lor  downstream  pro¬ 
cessing,  if  a  different  decision  region  is  more  preferable,  using  the 
curves  will  easily  expose  such  information  as  which  model  performs 
better.  Lor  example,  [2]  shows  that  the  optimal  point  for  parsing 
is  different  from  that  chosen  to  optimize  the  single  NIST  error  rate 
(intuitively,  shorter  utterances  are  more  appropriate  for  parsing). 

Lor  the  PR,  ROC,  and  DET  curves,  from  the  discussion  in  Sec¬ 
tion  2,  we  know  that  the  dominance  of  a  curve  in  one  space  implies 
dominance  in  other  spaces.  Additionally,  if  a  curve  for  one  algorithm 
is  dominant  over  another  one,  then  the  AUC  is  greater.  However, 
that  the  AUC  of  a  curve  is  better  than  another  does  not  mean  that  the 
curve  is  dominant.  Similarly,  the  AUC  comparison  for  the  PR  and 
ROC  curves  can  be  different.  Lor  example,  comparing  HMM  and 
Maxent  on  both  corpora,  Maxent  has  better  AUC  in  the  PR  space 
(not  very  significant),  but  not  in  ROC,  as  shown  in  Table  2. 

In  many  cases,  curves  for  different  algorithms  cross  each  other; 
therefore,  it  is  not  easy  to  conclude  that  one  classifier  outperforms 
the  other.  The  decision  is  often  based  on  downstream  applications 
(e.g.,  improve  readability,  input  to  machine  translation,  or  informa¬ 
tion  extraction).  Lor  this  situation,  using  both  the  curves,  along  with 
single  value  measurement  is  a  better  idea.  Lor  visualization,  PR 
curves  expose  information  better  than  ROC,  especially  for  imbal¬ 
anced  data  sets.  DET  curves  are  also  easier  to  visualize  than  ROC 


curves,  and  more  effectively  show  differences  between  algorithms. 

4.  CONCLUSIONS 

We  used  a  real-world  spoken  language  processing  task  to  compare 
different  performance  metrics  for  sentence  boundary  detection.  While 
this  study  is  based  on  a  particular  sentence  boundary  detection  sys¬ 
tem  and  posterior  probability  distribution,  the  focus  was  on  gen¬ 
eral  comparison  of  alternative  metrics,  rather  than  on  performance 
specifics.  We  examined  single  metrics,  including  the  NIST  error 
rate,  classification  error  rate,  precision,  recall,  and  AUC,  as  well  as 
decision  curves  (ROC,  PR,  and  DET).  Single  metrics  provide  lim¬ 
ited  information;  decision  curves  illustrate  which  model  is  best  for 
a  specific  region  and  should  be  preferable  for  downstream  language 
processing.  Lurthermore,  data  skew  has  an  impact  on  metrics.  Lor  an 
imbalanced  data  set,  PR  curves  generally  provide  better  visualization 
than  do  ROC  curves,  for  viewing  differences  among  different  algo¬ 
rithms.  Linally,  while  the  analysis  in  this  paper  is  based  on  sentence 
boundary  detection,  the  nature  of  this  task  is  similar  to  many  other 
language  processing  applications  (e.g.,  story  segmentation).  Hence, 
findings  should  be  generalizable  to  other  similar  tasks. 
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